Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GCI mounter issue #38124

Merged
merged 2 commits into from
Dec 7, 2016
Merged

Fix GCI mounter issue #38124

merged 2 commits into from
Dec 7, 2016

Conversation

jingxu97
Copy link
Contributor

@jingxu97 jingxu97 commented Dec 5, 2016

Two fixes in this PR

  1. Remove break in the mounter script to make sure gc run multiple times
    To make sure existed container created by mounter script gets garbage collected, we make rkt gc run multiple times

  2. Fix unmount issue cased by GlusterFS mounter
    When GlusterFS volume is mounted on GCI image cluster, the GCI mounter script will create a container and then execute mount in the container. Because mount for GlusterFS will run a glusterfs daemon, the container created by the mounter will continue be in running state and rkt garbage collection will not collect it. Because of the shared mount propagation setting, any mounts on the host will be reflected to the container which might cause a large mount table. Also, in current umount device logic, we check the mount references before issuing the operations. In this case, unmount device will fail because of the extra mount points in the container. The following gives an example of original mount path and the shared mount point created in container.

a. original mount
/var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/nfs-disk rw,relatime shared:133 - ext4 /dev/sdb rw,data=ordered
b. shared mount in container
/var/lib/rkt/pods/run/3496eadc-74d9-4349-9370-db7636f5f963/stage1/rootfs/opt/stage2/gci mounter/rootfs/var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/nfs-disk rw,relatime shared:133 - ext4 /dev/sdb rw,data=ordered

The current workaround for this issue is: since these shared mounts are irrelevant to the original mounts, they should be not considered when checking the mount references. By comparing the mount paths, the shared mount path contains the full original mount path so they should be filtered out.

Remove break in the mounter script to make sure gc run multiple times
@k8s-reviewable
Copy link

This change is Reviewable

@jingxu97 jingxu97 added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 5, 2016
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 5, 2016
@jingxu97 jingxu97 added this to the v1.5 milestone Dec 5, 2016
@k8s-github-robot k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note-label-needed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 5, 2016
@jingxu97 jingxu97 changed the title Fix GCI mounter script to run garbage collection multiple times Fix GCI mounter issue Dec 5, 2016
@jingxu97 jingxu97 added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Dec 5, 2016
// Find all references to the device.
var refs []string
if deviceName == "" {
glog.Warningf("could not determine device for path: %q", mountPath)
} else {
for i := range mps {
if mps[i].Device == deviceName && mps[i].Path != slTarget {
refs = append(refs, mps[i].Path)
mpPrefix := getPodsDirPrefix(mps[i].Path, options.DefaultKubeletPodsDirName)
if mpPrefix == podsDirPrefix {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry with this is that it may skip checks on non-rkt mounts as well. Let's see if we can minimize this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to filter out the mount refs based on string "gci-mounter" in the path

@k8s-ci-robot
Copy link
Contributor

Jenkins verification failed for commit 6cb0ae2. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins GKE smoke e2e failed for commit a22f52e. Full PR test history.

The magic incantation to run this job again is @k8s-bot cvm gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@@ -130,14 +130,24 @@ func GetMountRefs(mounter Interface, mountPath string) ([]string, error) {
}
}

// TODO: this is a workaround for the unmount device issue caused by gci mounter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a total hack. gci_mounter is a GKE specific deployment detail. We should not be leaking that logic into upstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed a few options, but they are all seem hacky...

@@ -31,7 +31,7 @@ function gc {
# Rkt pods end up creating new copies of mounts on the host. Hence it is ideal to clean them up right away.
attempt=0
until [ $attempt -ge 5 ]; do
${RKT_BINARY} gc --grace-period=0s &> /dev/null && break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you making this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise, it only run gc once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I test it, I think sometimes the container still in running state in the moment when gc is executed. Give a few try can give more chances to garbage collect the containers . I also thought it was intended to execute 5 times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was only intended to run multiple times if any of the previous attempts fail. Blocking for 5 seconds doesn't sound like a good idea. Is it possible to detect incomplete gc and then retry?
Also why is GC of concern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the container is not GC, it causes those extra mounts which makes

  1. mount table is very big if there are many mount points, which might be a performance concern since we list the mount points during unmount operations
  2. Without this PR's fix, unmount device will fail because of those unrelated mount references. (umount device function first check whether there are other mount references for the mount path and will fail the operation if there are some exist)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 might be of concern. Do we have any data to back up that concern?
I don't get 2. Why is umount failing if other shared recursive mounts exist? Is the umount syscall failing or is it our own custom logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jingxu97

it is our own logic to check the mount references.

Can't we fix that logic then? As long as there is no separate mount point with the same device (bind mount), we can safely unmount. Check the mount reference number in /proc/self/mountinfo to make that decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a mount appears in two locations via a shared mount then /proc/self/mountinfo will explicitly indicate that. Here is an example.

$ cat /proc/self/mountinfo
...
90 80 0:66 / /tmp/original/tmpfs rw,relatime **shared:2** - tmpfs tmpfs rw
103 101 0:66 / /tmp/shared/tmpfs rw,relatime **shared:2** - tmpfs tmpfs rw

Look at the mount propagation type followed by a propagation number. If the number is the same, then that mount can be ignored. A umount from any one of those paths will result in the mount disappearing from both locations.

$ sudo umount /tmp/original/tmpfs 
vishnuk@vishnuk-linux:~$ cat /proc/self/mountinfo
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current solution used here is checking the mount path and the shared mount path created by the rkt container. If the shared mount path contains the original mount path, this will be filtered out during GetMountRef. See comments #38124 (comment)

@jingxu97
Copy link
Contributor Author

jingxu97 commented Dec 6, 2016

@k8s-bot cvm gke e2e test this

@jingxu97
Copy link
Contributor Author

jingxu97 commented Dec 6, 2016 via email

@saad-ali
Copy link
Member

saad-ali commented Dec 6, 2016

Please finalize on this PR ASAP, cutoff for 1.5 is Tuesday.

@vishh
Copy link
Contributor

vishh commented Dec 6, 2016

I'm not comfortable with this PR because of #38124 (comment)

@jingxu97
Copy link
Contributor Author

jingxu97 commented Dec 6, 2016 via email

@jingxu97
Copy link
Contributor Author

jingxu97 commented Dec 6, 2016

Updated the PR. PTAL.

@k8s-ci-robot
Copy link
Contributor

Jenkins unit/integration failed for commit 887cd40. Full PR test history.

The magic incantation to run this job again is @k8s-bot unit test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

this is a workaround for the unmount device issue caused by gci mounter. In GCI cluster, if gci mounter is used for mounting, the container started by mounter script will cause additional mounts created in the container. Since these mounts are irrelavant to the original mounts, they should be not considered when checking the mount references. By comparing the mount path prefix, those additional mounts can be filtered out.

Plan to work on better approach to solve this issue.
@saad-ali saad-ali added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Dec 6, 2016
@saad-ali
Copy link
Member

saad-ali commented Dec 6, 2016

@vishh Without this change when anyone uses GlusterFS via mounter, all other mounts will fail to unmount. That's pretty bad. We need this fixed for the 1.5 release ASAP. This is hacky and I'd like to have a cleaner solution, but it gets the job done. @jingxu97 please open an issue to revisit and fix this code with a more robust solution.

@saad-ali
Copy link
Member

saad-ali commented Dec 6, 2016

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 6, 2016
@saad-ali saad-ali removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Dec 6, 2016
@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@vishh
Copy link
Contributor

vishh commented Dec 7, 2016

I hope by merging this in a hurry, we did not introduce new bugs :)

@saad-ali
Copy link
Member

saad-ali commented Dec 7, 2016

+1 We'll let it bake overnight in master. If the world hasn't burned down, we'll cherry pick this in the AM (Wed morning PST).

@vishh
Copy link
Contributor

vishh commented Dec 7, 2016 via email

@thockin thockin deleted the Dec/gluster branch December 7, 2016 06:47
@saad-ali saad-ali added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Dec 7, 2016
k8s-github-robot pushed a commit that referenced this pull request Dec 7, 2016
…24-upstream-release-1.5

Automatic merge from submit-queue

Automated cherry pick of #38124

Cherry pick of #38124 on release-1.5.

#38124: Fix GCI mounter script to run garbage collection multiple
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.5" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

jingxu97 added a commit to jingxu97/kubernetes that referenced this pull request Dec 8, 2016
This is a fix on top kubernetes#38124. In this fix, we move the logic to filter
out shared mount references into operation_executor's UnmountDevice
function to avoid this part is being used by other types volumes such as
rdb, azure etc. This filter function should be only needed during
unmount device for GCI image.
@jessfraz jessfraz added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Dec 8, 2016
k8s-github-robot pushed a commit that referenced this pull request Dec 8, 2016
Automatic merge from submit-queue

Fix unmountDevice issue caused by shared mount in GCI

This is a fix on top #38124. In this fix, we move the logic to filter
out shared mount references into operation_executor's UnmountDevice
function to avoid this part is being used by other types volumes such as
rdb, azure etc. This filter function should be only needed during
unmount device for GCI image.
jingxu97 added a commit to jingxu97/kubernetes that referenced this pull request Dec 8, 2016
This is a fix on top kubernetes#38124. In this fix, we move the logic to filter
out shared mount references into operation_executor's UnmountDevice
function to avoid this part is being used by other types volumes such as
rdb, azure etc. This filter function should be only needed during
unmount device for GCI image.
jingxu97 added a commit to jingxu97/kubernetes that referenced this pull request Dec 8, 2016
This is a fix on top kubernetes#38124. In this fix, we move the logic to filter
out shared mount references into operation_executor's UnmountDevice
function to avoid this part is being used by other types volumes such as
rdb, azure etc. This filter function should be only needed during
unmount device for GCI image.
jessfraz added a commit that referenced this pull request Dec 9, 2016
…24-upstream-release-1.4

Automated cherry pick of #38124
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants